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Abstract 

A multivariate PLS-QSAR study with a data set of 31 cinnamoyl pyrrolidine 
derivatives described as type 2 matrix metalloproteinases (MMP-2) inhibitors is 
presented in this paper. The variable selection was performed with the Ordered 
Predictors Selection (OPS) algorithm. The PLS model presented six descriptors 
and three Latent Variables (LV) that cumulated 71 .845% of variance. Leave-N- 
out (LNO) cross validation and y-randomization tests showed that the model 
presented robustness and no chance correlation, respectively. The descriptors 
indicated that MMP-2 inhibition depends mainly on the electronic properties of 
the compounds. The model obtained can be useful as a support tool in the 
design of new MMP-2 inhibitors. 
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Introduction 

The matrix metalloproteinases (MMPs) are a family of enzymes that are intimately involved 
in tissue remodeling. These zinc-containing endopeptidases consist of subsets of 
enzymes, and they are involved in the degradation of the extracellular matrix (ECM) that 
forms the connective material between cells and around tissues. In pathologic conditions 
an increase of MMP activity occurs, leading to tissue degradation [1]. 
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Currently, about 27 MMPs are known. Their overexpression is associated with several 
diseases: cancer, cardiovascular diseases (including congestive heart failure), 
osteoarthritis, rheumatoid arthritis, chronic obstructive pulmonary disease, psoriasis, 
dermatitis, Alzheimer's disease and periodontitis, among others [1,2]. Thus, MMPs are 
currently an interesting target for drug design. However, despite the great amount of 
research, the tetracycline doxycycline (Fig. 1) is the only MMP inhibitor available in 
therapeutics. This longer-acting antibiotic also presents a weak inhibition of collagenases 
(MMPs-1, 8 and 13), and it is currently marketed for clinical treatment of chronic 
periodontal disease [3-5]. 




Fig. 1. Structural formula of doxycycline hyclate (Periostat®, CollaGenex 
Pharmaceuticals). 



Among the MMPs, MMP2 and MMP9 are named gelatinases. These enzymes are able to 
degrade a broad range of matrix substrates, including gelatin, type IV collagen of basal 
laminae, as well as other nonhelical collagen domains and proteins, such as fibronectin 
and laminin, that constitute cellular connective tissue and are strongly involved in both 
normal and pathological tissue remodeling [1,6]. The overexpression of this subclass, 
especially MMP2, is found to be strongly correlated to an aggressive malignant phenotype, 
and it presents poor prognosis for several types of aggressive cancer, such as ovarian, 
lung, breast, bladder and gastric cancers [6-8]. Thus, MMP2 inhibitors have been studied 
as a target for anticancer drug design. 

Quantitative structure-activity relationship (QSAR) describes how a given biological activity 
can vary as a function of molecular descriptors derived from the chemical structure of a set 
of molecules. A model containing those calculated descriptors can be used to predict 
responses from new compounds, constituting an important tool to support the synthesis of 
new drugs [9, 10]. Thus, considering the continuous need for new anticancer drugs, a 
QSAR study based on 31 cinnamoyl pyrrolidine derivatives (Table 1) synthesized and 
assayed by Zhang et al. [8] was carried out. The dataset was obtained through a 
hybridization approach between the L-hydroxyproline scaffold, the MMPs substrate, the 
cinnamic acid, an inhibitor of the A5491 human lung gland cancer, and the caffeic acid, an 
MMP-2 inhibitor (Fig. 2). The aim was obtaining a mathematical model that could be used 
for prediction of the inhibitory potency of new cinnamoyl pyrrolidine derivatives against 
MMP-2. 
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Tab. 1. Selected data set of cinnamoyl pyrrolidine derivatives and their respective 
inhibition potencies against MMP-2. 




Compound 3 
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R 3 


IC 50 (nM) 


PIC 50 D 


AO 
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H 


11.2 


7.951 


A1 


H 


H 


CH 3 CO 


128.4 


6.891 


A2 


H 


H 


CH 3 CH 2 CO 


98.1 


7.008 


A3 


H 


H 


CH 3 CH 2 CH 2 CO 


85.6 


7.068 


A4 


H 


H 


PhCO 


52.4 


7.281 


A5 


H 


H 


p-CI-PhCO 


31.8 


7.498 


A6 


H 


H 


2,3,4-(OCH 3 ) 3 -PhCO 


259.5 


6.586 


A7 


H 


H 


PhCH 2 CH 2 CO 


43.6 


7.361 


A8 


H 


H 


PhCH=CHCO 


5.2 


8.284 


A9 


H 


H 


p-CH 3 0-Ph-CH=CHCO 


12.3 


7.910 


A10 


H 


H 


3,4-(OCH 3 ) 2 -PhCH=CHCO 


13.1 


7.883 


BO 


H 


CH 3 


H 


439.8 


6.357 


B1 


H 


CH 3 


CH 3 CO 


316.4 


6.500 


B2 


H 


CH 3 


CH 3 CH 2 CO 


280.2 


6.553 


B3 


H 


CH 3 


CH 3 CH 2 CH 2 CO 


195 


6.710 


B4 


H 


CH 3 


PhCO 


109.9 


6.959 


B5 


H 


CH 3 


p-CI-PhCO 


42.8 


7.369 


B6 
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CH 3 


2,3,4-(OCH 3 ) 3 -PhCO 


562.6 


6.250 


B7 
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CH 3 


PhCH 2 CH 2 CO 


73.4 


7.134 


B8 
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CH 3 


PhCH=CHCO 


39.1 


7.408 


B9 
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CH 3 


p-CH 3 0-Ph-CH=CHCO 


7.8 


8.108 


B10 


H 


CH 3 


3,4-(OCH 3 ) 2 -PhCH=CHCO 


121.3 


6.916 
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CH 3 


CH 3 


CH 3 CO 


320.2 


6.495 
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CH 3 


CH 3 


CH 3 CH 2 CO 


293.4 
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C3 


CH 3 


CH 3 


CH 3 CH 2 CH 2 CO 


221.1 


6.655 


C4 


CH 3 


CH 3 


PhCO 


201.2 


6.696 


C5 


CH 3 


CH 3 


p-CI-PhCO 


111.8 


6.952 


C7 


CH 3 


CH 3 


PhCH 2 CH 2 CO 


168.3 


6.774 


C8 


CH 3 


CH 3 


PhCH=CHCO 


86.5 


7.063 


C9 


CH 3 


CH 3 


p-CH 3 0-Ph-CH=CHCO 


28.7 


7.542 


C10 


CH 3 


CH 3 


3,4-(OCH 3 ) 2 -PhCH=CHCO 


9.7 


8.013 



a same identification used the original work [8]; p p!C 50 = -log IC 50 . 
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L-hydroxyproline Cinnamic acid Caffeic acid 

Fig. 2. Structures of L-hydroxyproline, cinnamic acid and caffeic acid. 

Results and Discussion 

The study was carried out using the QSAR Modeling [11]. The variable selection with the 
Ordered Predictors Selection (OPS) algorithm [12-15] generated a model based on three 
Latent Variables (LV) that cumulate 71.845% of variance (LV1: 18.043%; LV2: 31.298%; 
LV3: 22.504%). These LV derivate from six selected descriptors: SOFT (softness), 
EEig02r (eigenvalue 02 from edge adjacent matrix weighted by resonance integrals), a xx 
(the component vector to the overall polarizability in the x-axis), qlONBO (partial charge of 
the atom #10 calculated through Natural Bond Orbitals approach), q2NBO (partial charge 
of the atom #2 calculated through Natural Bond Orbitals approach) and SsssN(oth) 
(E-state index for amino group attached to functional groups not aliphatic or aromatic). The 
values of each descriptor are available in the Supporting Information, Table S1. The 
standardized regression coefficients are -0.549 for EEig02r, 0.545 for SOFT, 0.377 for a xx , 
0.238 for qlONBO, 0.250 for q1 NBO, and -0.314 for SsssN(oth). According to Wold [16], 
regression coefficients larger than about half the maximum regression coefficient value 
indicate that the descriptor is significant for the PLS-QSAR model. Thus, the reference 
value is 0.274. The coefficients of q2NBO and qlONBO are lower than this value, but its 
removal decreases the statistical quality of the model. Thus, these descriptors can be 
considered important for the model. In addition, the maximum difference is only 0.036 
units, which is very low. Thus, both descriptors were maintained in the model. 

Fig. 3 shows the studentized residuals (a) versus the leverage samples plot, and it was 
used for the identification of outliers. No compound presented residuals higher than 2.5xa. 
Only one compound presented leverage higher than the leverage cutoff line, but it can be 
considered acceptable [17]. Therefore, the model can be considered free of outliers, 
something which guarantees the maximum possible representation in terms of structure 
and range of inhibitory activity for the dataset under study. 

The model (Equation I) explains 78.324% (fl^O.783) and predicts 61.844% {Cf LOO =0^8) 
of variance. The predicted values in the cross-validation step and the residuals are 
available in the Supporting Information, Table S2. The difference between the values of R 2 
and C?\_oo was 0.165 units. A large difference between R 2 and Q^oo exceeding 0.2-0.3 is 
a clear indication that the model suffers from overfitting [18]. Thus, this difference may be 
considered acceptable. The F value (32.521) was higher than the corresponding tabled 
value (p=3 and n-p-1=27) with a 95% confidence interval (ct=0.05). The value of PRESS va \ 
was smaller than SS y , another indicator of the statistical significance of the prediction [16]. 
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Eq. 1. plC 50 = 0.394(SOFT) -2.1 98(EEig02r) + 0.014(a xx ) + 80.105(q10NBO) + 

1 1 .339(q2NBO) - 9.21 8(SsssN(oth)) + 64.222 

n=31; fl^O.783; SEC=0.276; F (3 ,27)=32.521 (cF=2.960); 0100=0.618; 
SEV=0.342; PF?ESS va ,=3.621 (SS y =9.491). 
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Fig. 3. Outlier detection plot. The figure was built in QSAR Modeling [11]. 



The results obtained from y-randomization [19] analysis and LNO cross-validation [20] are 
available in Figs. 4 and 5. The y-randomization aids in verifying the possibility that the 
explained and predicted variances are due to chance correlation [19]. It can be observed 
that the results obtained for all randomized models have a bad quality when compared to 
the original model, because the intercepts are within the acceptable values recommended 
in literature, i.e., below 0.3 (Fig. 4A) and 0.05 (Fig. 4B). These results indicate that the 
variance explained by the model was not due to chance correlation. 




0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 



/<yrand,y) /tyrand,y) 

Fig. 4. Results of y-randomization test (A and B). The "r(yrand,y)" values in the x-axis 
are presented in absolute values. Figure built from the results generated in 
QSAR Modeling [11]. 
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LNO cross-validation (Fig. 5) employs smaller training sets than the LOO cross-validation, 
and it can be repeated several times, because of the large number of combinations that 
rise when more than one compound is left out from the training set, once at a time. A 
QSAR model can be considered robust when the average values of C^lno are relatively 
high and close to C^loo [19]. The model obtained in this study has an average Q^lno 
(0.604), only 0.014 units lower than C^loo- The standard deviation for each "AT (performed 
in hexaplicate) value is small, with the maximum of 0.055 for Cf L4 o- 
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Fig. 5. Results of LNO cross validation. The bars in the graphic represent standard 
deviations with regard to six tests for each "AT value. Figure built from the 
results generated in QSAR Modeling [11]. 



Some studies show that only externally validated models may be considered realistic and 
applicable for drug design [21-24]. The real model (II) was obtained after the split of data 
in training (n=26) and test (n=5) sets. The standardized regression coefficients of each 
descriptor are -0.579 for EEigOIx, 0.599 for SOFT, 0.362 for a xx , 0.149 for q10NBO, 0.322 
for q1 NBO, and -0.278 for SsssN(oth). The model (II) has statistical parameters similar to 
those for the auxiliary model (i.e., Eq. 1). Therefore, they can be considered equivalent 
and can be used in the external validation. 

Eq. 2. plC 50 = 0.450(SOFT) - 2.293(EEig01x) + 0.013(a xx ) + 61 .930(q1 0NBO) + 

14.508(q2NBO) - 8.637(SsssN(oth)) + 55.156 

n=26; fl^O.809; SEC=0.264; F (3 ,22)=31 .089 (cF=3.049); 0^00=0.626; 
SEV=0.340; PF?ESS va i=3.000 (SSy=8.026). 

Results obtained for the external validation (Table 2) show that the model has high 
external prediction power, considering the proposed limits. R 2 pre 6, tool used as a measure 
of the model's external predictive power, was higher than the adopted threshold {R 2 ^ = 
0.641 > 0.5), and the associated error (SEP) with this parameter may be considered low. 
The Golbraikh-Tropsha statistics [25, 26] aid to confirm the prediction power of the model. 
Both values of k and /('and the relation \F?o-R' 2 o\ are within acceptable ranges (0.85 < x< 
1.15, where x= /cor k', and \F? 0 -Fl 2 o\ < 0.3). 
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Tab. 2. Results from external validation step performed through the real model (II). 
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-0.373 
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7.542 
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fl 2 pred 


0.641 






SEP 


0.325 
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1.017 






k' 


0.981 






\f?o-R' 2 o\ 


0.004 







It can be observed that the obtained model has reasonable internal and external quality. 
However, it is always desirable to obtain a model that is able to relate the physicochemical 
properties represented by the selected molecular descriptors to the action mechanism of 
the system under study [27]. Zhang et al. [8] described the experimental structure-activity 
relationships of the data set, highlighting the importance of heteroatoms (especially the 
hydroxil group) to form hydrogen bonds, and tt electrons to facilitate interactions with 
hydrophobic regions of the receptor, and a slight decrease in inhibitory potency with the 
addition of methoxyl to Ri and R2. Furthermore, a docking study indicated that the ester 
carbonyl (atom #20) could bind with the zinc located in the active site, the lateral chain 
represented in this paper by R3 bind with the S1 ' cavity, and the lateral chain attached to 
the nitrogen bind with the S1 cavity. A representation of the metalloproteinases active site 
[28, 29] is presented in Fig. 6. 

The SOFT, a quantum chemical descriptor, was calculated using the relation 
SOFT=1/GAP, where GAP is the difference between the energies (calculated at B3LYP/6- 
31 1 (d,p) theory level) of lowest unoccupied molecular orbital and highest occupied 
molecular orbital (E L umo _ Ehomo)- These molecular descriptors are known to be related to 
molecular reactivity. Generally, softer molecules are more reactive [26, 30]. As the SOFT 
coefficient is positively correlated to pICso, this indicates that derivatives with high value for 
this descriptor will react more easily. The histogram presented in Fig. 7 shows exactly this 
trend: considering the 16 most active compounds, only four (A2, A3, A7, and AO) have 
SOFT < 5. The compounds found among the most active have a greater tendency to 
present many heteroatoms (oxygen and chlorine) and tt electrons in the substituent R3, in 
agreement with the experimental structure-activity relationships discussed by Zhang et al. 
[8], probably by facilitating the interaction with the enzyme via hydrogen and hydrophobic 
bonds. Thus, similar to what was proposed by Liu et al. for a set of ct-glucosidase inhibitors 
[30], the inhibitory activity would be expected to be improved by introducing more 
heteroatoms and electrons tt in the structure of new derivatives. 
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Fig. 6. Representation of the binding site of metalloproteinases, highlighting the most 
important points (sites S1 , S1 S2, and Zn ). The structure used corresponds 
to MMP-3 (stromelysin-1), PDB 1 D7X [28]. The figure was built in Accelrys 
Discovery Studio Visualizer 2.5 [29]. 
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Fig. 7. Histogram presenting the SOFT of dataset in relation to MMP-2 inhibitory 
potency. 



The EEig02r, which presents a negative coefficient, is an edge adjacency index, a 
topological descriptor derived from the edge adjacency matrix, also called bond matrix, 
which encodes the connectivity between graph edges [26,31]. In this approach, as in 
many other graph theoretical representations of chemical structures, the vertices of the 
molecular graph represent atoms and edges represent bonds in molecules. The edge 
adjacency index with this weighting scheme is sensitive to the presence of heteroatoms 
and multiple bonds in the molecule [26]. This class of descriptors can be weighted by 
several different atomic properties. The most interesting aspect of the presence of a 
weighted-resonance index in the model is that this weighting scheme turns the descriptor 
more sensitive to the presence of heteroatoms and multiple bonds in the molecule [26]. 
So, its selection by OPS algorithm may be, again, related to the importance of 
heteroatoms and tt electrons in the R3 substituent. 



Sci Pharm. 2012; 80: 265-281 



A QSAR Study of MMP-2 Inhibitors 



273 



The a xx , calculated in the Marvin 4.1.8 [32] through a method based on the empiric model 
proposed by Miller and Savchik [33], describes the ability of a molecule to be polarized in 
the X Cartesian axis. The signal of the coefficient is positive, indicating that the 
improvement of the polarization in this plane is favorable to the activity. In Fig. 8 it is 
possible to see that the x-axis always crosses the frontal region of the structures. The size 
of R3 substituent causes a slight shift in the position of the axis, as it can be seen in the 
compounds CO (low potent) and C10 (high potent). This information can be related to the 
interpretation proposed for the SOFT, since the presence of a greater number of 
heteroatoms and tt electrons in R 3 increase the polarization of this Cartesian axis. 




plC M = 6.357 PIC M - 8-013 



Fig. 8. Cartesian axes' representation for compounds BO and C10. The z-axis is 
located perpendicular to the plane of projection. 

The q2NBO and qlONBO are atomic charges descriptors calculated using the Natural 
Bond Orbital (NBO) theory. The charges measure the extent of electronic density 
localization in a molecule. Negative q n values mean that there is excess electronic charge 
in the atom while positive values mean that the atom is electron-deficient [26]. It is possible 
to observe that the charge of atom #2 undergoes a slight increase in electron density (see 
Supporting Information, Table S1) in subsets B and C, probably due to an electron donor 
effect resulting from the insertion of the methoxyl at positions R1 and R2. This effect was 
more pronounced in the subset B (only R 2 substituent) than in the subset C (substituents 
at R1 and R2). Interestingly, the compounds of subset A are generally more potent than 
their corresponding in subsets B and C, which have, in general, higher electron densities 
in the atom #1. It can be proposed, since the sign of its coefficient is positive, that an 
electron donor effect caused by the insertion of the methoxyl in the aromatic ring 
decreases its electron density, hampering the interaction of this group with the S1 site of 
MMP-2. This same effect can be observed, in a less pronounced manner, in the atom #10, 
the nitrogen of pyrrolidine ring, since the descriptor qlONBO also has a positive coefficient. 

The SsssN(oth) is an atom type E-state (electrotopological state) index, and it also 
corresponds to the nitrogen from the pyrrolidine ring. The E-state formalism considers that 
each atom or bond has an intrinsic state, which is disturbed by every other atom or bond in 
the molecule. This state encodes information about the electronic distribution (as a 
variation caused by all other atoms) and topological aspects (major/minor accessibility of 
atoms and bonds to the external environment), and how such information can influence 
intermolecular interactions [26, 34]. Since this descriptor is also related to the atom #10, 
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this indicates that, although the most important point of structural variation for the activity is 
the R3 substituent, other parts of the molecule also influence the activity. The pyrrolidine 
nitrogen, for example, is close to the ester carbonyl side chain, the binding point with the 
zinc atom located in the active site of MMP-2. The negative coefficient indicates that the 
decrease of this descriptor is favorable to the activity. Among the dataset, the lowest 
SsssN(oth) values are in the A subset (Supporting Information, Table S1). This subset has 
no substituents in R1 and R2 (Table 1). Thus, it may indicate that these substitutions also 
affect the intrinsic value of nitrogen, as well as the partial charge descriptor qlONBO, 
influencing the interactions that this part of the molecule can have with the binding site of 
MMP-2. 

Interestingly, the three most important descriptors (EEig02r, SOFT and ct xx ), considering 
the standardized coefficients of the real model (Eq. 2), are related exactly to the R3 
substituents, the main point of structural variation in the dataset, and it is therefore 
primarily responsible for the variation in inhibitory potency. This result strengthens the 
importance of hydrogen and hydrophobic bonds to S1' binding site of MMP-2, and 
demonstrates how the manipulation of this characteristic in structurally related compounds 
can be useful in the design of new cinnamoyl pyrrolidine derivatives able to inhibit MMP-2. 

Conclusion 

The model obtained using the OPS, an algorithm for variable selection, showed a 
statistically significant internal and external prediction power. In addition, the LNO cross- 
validation shows the model is robust, and in the y-randomization test it shows the model 
does not present chance correlation. The selected descriptors suggest that the presence 
of heteroatoms, especially, and tt electrons in the R 3 substituent can be important for the 
binding of compounds to the regions S1' of the binding site of MMP-2, but the handling of 
electronic distribution in the side chain attached to the pyrrolidinic nitrogen, which binds to 
the S1 site, can also be exploited for the design of new active derivatives. The 
manipulation of these features can assist in obtaining new lead compounds that can be 
useful for developing new drugs used in the chemotherapy for treating aggressive cancers. 

Experimental 

Molecular Modeling 

Three-dimensional structures were built using HyperChem 7 [35] from the structure 
ZINC40405643, obtained in the ZINC Database (http://zinc.docking.org) [36]. Calculations 
of MM+ force field were carried out using the same software. The most stable 
conformations obtained were further optimized at AM1 semi-empirical quantum 
mechanical method, followed by Hartree-Fock level (HF/6-31G(d)) and Density Functional 
Theory (DFT) level (B3LYP/6-31 1G(d,p)) using Gaussian 09 [37]. The DFT/B3LYP was 
chosen as method for obtaining the geometries and electronic properties because it leads 
to quite satisfactory results in the analysis with such aims [9, 10]. 

Molecular descriptors 

The SMILES strings [38] of each compound were used to obtain E-state indices in the 
Parameter Client [39]. The optimized geometries were used to obtain, in the Dragon 3.0 
Web Version [31], the following classes of descriptors: constitutional descriptors, functional 
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groups counts, charge descriptors, molecular properties, walk and path counts, information 
indices, edge adjacency indices, topological charge indices, topological descriptors, 
connectivity indices, 2D autocorrelations, Burden eigenvalues, and eigenvalue-based 
indices. The optimized geometries were also used to obtain the electronic descriptors in 
the Gauss View 5 [40]. Partial charges of the basic structure were calculated by means of 
two approaches: Mulliken Charges and Natural Bond Orders [41]. In the Marvin 4.1.8 [32], 
it was obtained the molecular polarizability (a) and the respective vectorial components 
(ct xx , a yy and a zz ). After removal of missing, invariants, and quasi-invariants descriptors 
calculated in the Dragon 3.0, a total of 439 molecular descriptors were available for use. 

Mathematical method 

The partial least squares (PLS), a classical chemometric method, was employed to 
explore the quantitative relationships between the training set and MMP-2 inhibition. In this 
calibration method, LV are obtained including the dependent variable (in this case, pICso) 
in the analysis in such a way that the covariance between the projection of the samples in 
the new axis system (also orthogonal) and the dependent variable is maximized [42, 43]. 
For this, descriptors should be preprocessed using the autoscaling scheme (columnwise 
mean-centered and scaled to unity variance). Thus, they can be compared to each other 
on the same scale. 

Variable selection 

The step of variable selection in a QSAR study is a way to identify reduced subsets of 
descriptors that in fact reproduce the observed values of a biological activity, i.e. those that 
are the most useful to obtain a more accurate prediction model. The use of a good variable 
selection method helps to obtain the subset to reach an optimal mathematical equation for 
the prediction of the activity under study and, therefore, simple, robust, and more easily 
interpretable models [44, 45]. In this study, a two-step procedure was employed: (i) the 
439 original descriptors were reduced to 81 by eliminating those that presented the 
absolute value of Pearson's correlation coefficient (|r|) with pICso lower than 0.3; and (ii) 
the ordered predictor selection (OPS) algorithm [12-15] was used to select the most 
important descriptors. OPS is able to build PLS models by rearranging the columns of the 
matrix in such a way that the most important descriptors, classified according to an 
informative vector (available options: correlation vector, regression vector and an element- 
wise product between both), are placed in the first columns. Then, successive PLS 
regressions are performed with an increasing number of descriptors to find the best model. 
In this work, the three informative vectors were used. The best models were classified in 
descending order of statistical quality according to their coefficient of determination of 
leave-one-out cross validation (Ctloo) or standard error of cross validation (SEV) values. 
OPS is implemented in QSAR Modeling [11], a free JAVA-based software developed by 
the courtesy of the Theoretical and Applied Chemometrics Laboratory's research group 
(http://lqta.iqm.unicamp.br). 

Model validation 

Several statistical tools (see Supporting Information) are suggested in literature for 
validation of QSAR models. For the internal quality, the adopted parameters were the 
coefficient of multiple determination of calibration (Fr), standard error of calibration (SEC), 
F-ratio test with a 95% confidence interval (F, ct=0.05) C/ioo, SEV and predictive residual 
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sum of squares of validation {PRESS va \) [18]. The adopted limits are R 2 > 0.6 and Q^oo > 
0.5. SEC and SEV values should be as low as possible. For PRESS ya \, values should be 
lower than the sum of squares of the response values (SSy) [19]. F-test value should be 
higher than the tabled F value (F An -p-i, where n is the number of compounds and p is the 
number of LV) and the higher the difference between them, the more statistically 
significant is the model [46]. 

The robustness of the model was examined through leave-A/-out (LNO) cross validation, 
with A/=1 to 7. This test was repeated three times for each "A/" value. All rows from the data 
matrix and respective y values were randomized in each step of LNO process. It is 
expected that the average value of each C^lno would be close to Cftoo (coefficient of 
multiple determination of leave-one-out cross validation) with standard deviations close to 
zero [21]. The possibility of chance correlation was tested using y-randomization test, 
where only the y vector (pICso) was scrambled 10 times. The approach suggested by 
Eriksson et al. [20], based on the \r\ between the original vector y and the randomized 
vectors y, was used to quantify chance correlation. In this approach, two regression lines 
are built using these correlation coefficients (x-axis) and the R 2 and C^loo values (y-axis). 
The intercepts of the equations obtained in the linear regression should be lower than 0.3 
for R 2 and 0.05 for Q^oo- 

Once internally validated, the data set was split into training set (n=26) and test set (n=5), 
generating the real model [18]. The test set was selected manually, in such a way that the 
entire range of pICso (6.25 to 8.208, 1 .958 logarithmic units) and the structural variations of 
the data set were well represented. A dendrogram obtained for the complete data set by 
Hierarchical Cluster Analysis (HCA) [47] (Supporting Information, Fig. S1) aid to confirm 
that the selected compounds are suitable as test set. Thus, a structurally representative 
test set could be formed by the compounds B2 (plC 5 o=6.553), C4 (plC 5 o=6.696), C5 
(plC 50 =6.952), C9 (plC 50 =7.542), and AO (plC 50 =7.951). The HCA analysis are performed 
in Pirouette 4 [48]. 

The parameter coefficient for multiple determination of prediction [Ffpred) and standard 
error of external prediction (SEP) was used as a measure of the predictive power of a 
QSAR model. The recommended limit is P^ pre d > 0.5 [49], and SEP values also should be 
as low as possible. However, this is not enough to guarantee that the model is really 
predictive. It is also recommended to check: (i) the slopes k or k' of the linear regression 
lines between the observed activity (y) and the predicted activity in the external validation 
(y e , ), where the slopes should be 0.85 < x< 1.15 {x = kor k); and (ii) the absolute value of 

the difference between the coefficients of multiple determination, Ff 0 and R 2 o, smaller 
than 0.3 [26, 27]. 
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